Analysis of the WNV dataset from https://raw.githubusercontent.com/jdrakephd/ideas-workshop/master/wnv.csv.
Let’s examine our data to start -
head(wnv)
## # A tibble: 6 x 9
## State Year EncephMen Fever Other Total Fatal Latitude Longitude
## <chr> <int> <int> <int> <int> <int> <int> <dbl> <dbl>
## 1 New York 1999 59 3 0 62 7 42.5 -75.3
## 2 Connecticut 2000 0 1 0 1 0 41.5 -72.8
## 3 New Jersey 2000 5 1 0 6 1 40.2 -74.7
## 4 New York 2000 14 0 0 14 1 42.5 -75.3
## 5 Alabama 2001 2 0 0 2 1 32.3 -86.9
## 6 Connecticut 2001 6 0 0 6 1 41.5 -72.8
wnv %>% select(-State) %>% ggpairs(.)
How do the cases per year change?
Although hard to read, we can see a definite change in both cases per year and a difference in cases per state.
WNV can cause meningitis, but not in all cases.
What is the neuroinvasive disease rate of WNV, and how does this differ by state?
#First, let us calculate the neuroinvasive disease rate:
wnv <- mutate(wnv, ndr = EncephMen / Total)
#Then, we define two functions to calculate the Standard Error (se) and Mean of the ndr:
#function for finding mean
fun_mean <- function(x) {
l <- length(x)
s <- sum(x)
m <- s / l
return(m)
}
#function for finding standard error
fun_se <- function (x){
se <- sd(x) / sqrt(length(x))
return(se)
}
#Then, we can plot the mean ndr per state and include the standard error:
wnv %>%
group_by(State) %>%
summarise(ndr_mean = fun_mean(ndr), ndr_se=fun_se(ndr)) %>%
ggplot(aes(x = State, y = ndr_mean)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = ndr_mean - ndr_se,
ymax = ndr_mean + ndr_se)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 2 rows containing missing values (geom_errorbar).
Is there a location effect on WNV death rates?
We can check!
To start, we’ll define the case fatality rate (CFR) -
wnv %>%
ungroup() %>%
group_by(State, Year) %>%
summarise(CFR = (Fatal / Total) * 100)
## # A tibble: 272 x 3
## # Groups: State [?]
## State Year CFR
## <chr> <int> <dbl>
## 1 Alabama 2001 50.0
## 2 Alabama 2002 6.12
## 3 Alabama 2003 8.11
## 4 Alabama 2004 0.
## 5 Alabama 2005 20.0
## 6 Alabama 2006 0.
## 7 Alabama 2007 12.5
## 8 Arizona 2003 7.69
## 9 Arizona 2004 4.09
## 10 Arizona 2005 4.42
## # ... with 262 more rows
wnv$CFR <- (wnv$Fatal / wnv$Total) * 100
head(wnv)
## # A tibble: 6 x 11
## State Year EncephMen Fever Other Total Fatal Latitude Longitude ndr
## <chr> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 New Yo~ 1999 59 3 0 62 7 42.5 -75.3 0.952
## 2 Connec~ 2000 0 1 0 1 0 41.5 -72.8 0.
## 3 New Je~ 2000 5 1 0 6 1 40.2 -74.7 0.833
## 4 New Yo~ 2000 14 0 0 14 1 42.5 -75.3 1.00
## 5 Alabama 2001 2 0 0 2 1 32.3 -86.9 1.00
## 6 Connec~ 2001 6 0 0 6 1 41.5 -72.8 1.00
## # ... with 1 more variable: CFR <dbl>
We’ll first look at the difference between Eastern and Western states. We’ll classify each state as either ‘East’ or ‘West’, depending on their location in relation to the mean latitude of the states.
wnv$east_west <- ifelse(wnv$Longitude < mean(wnv$Longitude), "West", "East")
table(wnv$east_west)
##
## East West
## 151 121
Then, we can group_by the state’s location, and test to see if there is a difference in rates:
wnv %>%
group_by(east_west) %>%
summarise(CFR_mean = fun_mean(CFR), CFR_se=fun_se(CFR))
## # A tibble: 2 x 3
## east_west CFR_mean CFR_se
## <chr> <dbl> <dbl>
## 1 East 7.28 0.766
## 2 West 3.91 0.354
#non-parametric test needed
hist(wnv$CFR)
wilcox.test(wnv$CFR ~ wnv$east_west)
##
## Wilcoxon rank sum test with continuity correction
##
## data: wnv$CFR by wnv$east_west
## W = 10530, p-value = 0.02803
## alternative hypothesis: true location shift is not equal to 0
What about North vs South?
ggplot(wnv, aes(x = Latitude, y = CFR)) +
geom_point()
ggplot(wnv, aes(x = Latitude, y = CFR)) +
geom_smooth(method = "gam")
summary(lm(CFR ~ Latitude, data = wnv))
##
## Call:
## lm(formula = CFR ~ Latitude, data = wnv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.898 -4.935 -2.031 2.102 60.867
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.59437 3.85366 3.528 0.000493 ***
## Latitude -0.20065 0.09821 -2.043 0.042011 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.611 on 270 degrees of freedom
## Multiple R-squared: 0.01523, Adjusted R-squared: 0.01158
## F-statistic: 4.174 on 1 and 270 DF, p-value: 0.04201
Let’s look at the North vs South plot in more detail -
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`